Citation Request: This dataset is public available for research. The details are described in [Cortez et al., 2009]. Please include this citation if you plan to use this database:
P. Cortez, A. Cerdeira, F. Almeida, T. Matos and J. Reis. Modeling wine preferences by data mining from physicochemical properties. In Decision Support Systems, Elsevier, 47(4):547-553. ISSN: 0167-9236.
Available at: [@Elsevier] http://dx.doi.org/10.1016/j.dss.2009.05.016 [Pre-press (pdf)] http://www3.dsi.uminho.pt/pcortez/winequality09.pdf [bib] http://www3.dsi.uminho.pt/pcortez/dss09.bib
Title: Wine Quality
Sources Created by: Paulo Cortez (Univ. Minho), Antonio Cerdeira, Fernando Almeida, Telmo Matos and Jose Reis (CVRVV) @ 2009
Past Usage:
P. Cortez, A. Cerdeira, F. Almeida, T. Matos and J. Reis. Modeling wine preferences by data mining from physicochemical properties. In Decision Support Systems, Elsevier, 47(4):547-553. ISSN: 0167-9236.
In the above reference, two datasets were created, using red and white wine samples. The inputs include objective tests (e.g. PH values) and the output is based on sensory data (median of at least 3 evaluations made by wine experts). Each expert graded the wine quality between 0 (very bad) and 10 (very excellent). Several data mining methods were applied to model these datasets under a regression approach. The support vector machine model achieved the best results. Several metrics were computed: MAD, confusion matrix for a fixed error tolerance (T), etc. Also, we plot the relative importances of the input variables (as measured by a sensitivity analysis procedure).
The two datasets are related to red and white variants of the Portuguese “Vinho Verde” wine. For more details, consult: http://www.vinhoverde.pt/en/ or the reference [Cortez et al., 2009]. Due to privacy and logistic issues, only physicochemical (inputs) and sensory (the output) variables are available (e.g. there is no data about grape types, wine brand, wine selling price, etc.).
These datasets can be viewed as classification or regression tasks. The classes are ordered and not balanced (e.g. there are munch more normal wines than excellent or poor ones). Outlier detection algorithms could be used to detect the few excellent or poor wines. Also, we are not sure if all input variables are relevant. So it could be interesting to test feature selection methods.
Number of Instances: red wine - 1599; white wine - 4898.
Number of Attributes: 11 + output attribute
Note: several of the attributes may be correlated, thus it makes sense to apply some sort of feature selection.
For more information, read [Cortez et al., 2009].
Input variables (based on physicochemical tests): 1 - fixed acidity (tartaric acid - g / dm^3) 2 - volatile acidity (acetic acid - g / dm^3) 3 - citric acid (g / dm^3) 4 - residual sugar (g / dm^3) 5 - chlorides (sodium chloride - g / dm^3 6 - free sulfur dioxide (mg / dm^3) 7 - total sulfur dioxide (mg / dm^3) 8 - density (g / cm^3) 9 - pH 10 - sulphates (potassium sulphate - g / dm3) 11 - alcohol (% by volume) Output variable (based on sensory data): 12 - quality (score between 0 and 10)
Missing Attribute Values: None
Description of attributes:
1 - fixed acidity: most acids involved with wine or fixed or nonvolatile (do not evaporate readily)
2 - volatile acidity: the amount of acetic acid in wine, which at too high of levels can lead to an unpleasant, vinegar taste
3 - citric acid: found in small quantities, citric acid can add ‘freshness’ and flavor to wines
4 - residual sugar: the amount of sugar remaining after fermentation stops, it’s rare to find wines with less than 1 gram/liter and wines with greater than 45 grams/liter are considered sweet
5 - chlorides: the amount of salt in the wine
6 - free sulfur dioxide: the free form of SO2 exists in equilibrium between molecular SO2 (as a dissolved gas) and bisulfite ion; it prevents microbial growth and the oxidation of wine
7 - total sulfur dioxide: amount of free and bound forms of S02; in low concentrations, SO2 is mostly undetectable in wine, but at free SO2 concentrations over 50 ppm, SO2 becomes evident in the nose and taste of wine
8 - density: the density of water is close to that of water depending on the percent alcohol and sugar content
9 - pH: describes how acidic or basic a wine is on a scale from 0 (very acidic) to 14 (very basic); most wines are between 3-4 on the pH scale
10 - sulphates: a wine additive which can contribute to sulfur dioxide gas (S02) levels, wich acts as an antimicrobial and antioxidant
11 - alcohol: the percent alcohol content of the wine
Output variable (based on sensory data): 12 - quality (score between 0 and 10)
Short overview of the used data:
## Loading required package: ggplot2
## [1] 4898 14
## [1] "X" "fixed.acidity" "volatile.acidity"
## [4] "citric.acid" "residual.sugar" "chlorides"
## [7] "free.sulfur.dioxide" "total.sulfur.dioxide" "density"
## [10] "pH" "sulphates" "alcohol"
## [13] "quality" "qual"
## 'data.frame': 4898 obs. of 14 variables:
## $ X : int 1 2 3 4 5 6 7 8 9 10 ...
## $ fixed.acidity : num 7 6.3 8.1 7.2 7.2 8.1 6.2 7 6.3 8.1 ...
## $ volatile.acidity : num 0.27 0.3 0.28 0.23 0.23 0.28 0.32 0.27 0.3 0.22 ...
## $ citric.acid : num 0.36 0.34 0.4 0.32 0.32 0.4 0.16 0.36 0.34 0.43 ...
## $ residual.sugar : num 20.7 1.6 6.9 8.5 8.5 6.9 7 20.7 1.6 1.5 ...
## $ chlorides : num 0.045 0.049 0.05 0.058 0.058 0.05 0.045 0.045 0.049 0.044 ...
## $ free.sulfur.dioxide : num 45 14 30 47 47 30 30 45 14 28 ...
## $ total.sulfur.dioxide: num 170 132 97 186 186 97 136 170 132 129 ...
## $ density : num 1.001 0.994 0.995 0.996 0.996 ...
## $ pH : num 3 3.3 3.26 3.19 3.19 3.26 3.18 3 3.3 3.22 ...
## $ sulphates : num 0.45 0.49 0.44 0.4 0.4 0.44 0.47 0.45 0.49 0.45 ...
## $ alcohol : num 8.8 9.5 10.1 9.9 9.9 10.1 9.6 8.8 9.5 11 ...
## $ quality : Factor w/ 7 levels "3","4","5","6",..: 4 4 4 4 4 4 4 4 4 4 ...
## $ qual : int 6 6 6 6 6 6 6 6 6 6 ...
## X fixed.acidity volatile.acidity citric.acid
## Min. : 1 Min. : 3.800 Min. :0.0800 Min. :0.0000
## 1st Qu.:1225 1st Qu.: 6.300 1st Qu.:0.2100 1st Qu.:0.2700
## Median :2450 Median : 6.800 Median :0.2600 Median :0.3200
## Mean :2450 Mean : 6.855 Mean :0.2782 Mean :0.3342
## 3rd Qu.:3674 3rd Qu.: 7.300 3rd Qu.:0.3200 3rd Qu.:0.3900
## Max. :4898 Max. :14.200 Max. :1.1000 Max. :1.6600
##
## residual.sugar chlorides free.sulfur.dioxide
## Min. : 0.600 Min. :0.00900 Min. : 2.00
## 1st Qu.: 1.700 1st Qu.:0.03600 1st Qu.: 23.00
## Median : 5.200 Median :0.04300 Median : 34.00
## Mean : 6.391 Mean :0.04577 Mean : 35.31
## 3rd Qu.: 9.900 3rd Qu.:0.05000 3rd Qu.: 46.00
## Max. :65.800 Max. :0.34600 Max. :289.00
##
## total.sulfur.dioxide density pH sulphates
## Min. : 9.0 Min. :0.9871 Min. :2.720 Min. :0.2200
## 1st Qu.:108.0 1st Qu.:0.9917 1st Qu.:3.090 1st Qu.:0.4100
## Median :134.0 Median :0.9937 Median :3.180 Median :0.4700
## Mean :138.4 Mean :0.9940 Mean :3.188 Mean :0.4898
## 3rd Qu.:167.0 3rd Qu.:0.9961 3rd Qu.:3.280 3rd Qu.:0.5500
## Max. :440.0 Max. :1.0390 Max. :3.820 Max. :1.0800
##
## alcohol quality qual
## Min. : 8.00 3: 20 Min. :3.000
## 1st Qu.: 9.50 4: 163 1st Qu.:5.000
## Median :10.40 5:1457 Median :6.000
## Mean :10.51 6:2198 Mean :5.878
## 3rd Qu.:11.40 7: 880 3rd Qu.:6.000
## Max. :14.20 8: 175 Max. :9.000
## 9: 5
The worst quality is 3, the best quality is 9 in the dataset, to get a better understanding of the quality ranking we plot a histogram.
##
## 3 4 5 6 7 8 9
## 20 163 1457 2198 880 175 5
The most values are in quality group nr. 5 and 6 In our analysis we try find a linear model to estimate the quality of the wine from the given parameter. To do the fact that we have different number for each qulity group I would like to plot histograms four each group.
##
## 2.72 2.74 2.77 2.79 2.8 2.82 2.83 2.84 2.85 2.86 2.87 2.88 2.89 2.9 2.91
## 1 1 1 3 3 1 4 1 9 9 9 11 17 31 15
## 2.92 2.93 2.94 2.95 2.96 2.97 2.98 2.99 3 3.01 3.02 3.03 3.04 3.05 3.06
## 18 38 35 26 63 32 41 68 74 49 68 78 97 89 115
## 3.07 3.08 3.09 3.1 3.11 3.12 3.13 3.14 3.15 3.16 3.17 3.18 3.19 3.2 3.21
## 79 136 92 135 126 134 117 172 136 164 124 138 145 137 95
## 3.22 3.23 3.24 3.25 3.26 3.27 3.28 3.29 3.3 3.31 3.32 3.33 3.34 3.35 3.36
## 146 116 132 114 96 88 87 82 93 79 86 49 79 48 83
## 3.37 3.38 3.39 3.4 3.41 3.42 3.43 3.44 3.45 3.46 3.47 3.48 3.49 3.5 3.51
## 49 58 40 39 30 48 20 33 17 28 21 21 23 15 14
## 3.52 3.53 3.54 3.55 3.56 3.57 3.58 3.59 3.6 3.61 3.62 3.63 3.64 3.65 3.66
## 17 13 14 9 8 5 5 6 7 3 1 6 2 4 5
## 3.67 3.68 3.69 3.7 3.72 3.74 3.75 3.76 3.77 3.79 3.8 3.81 3.82
## 1 2 2 1 3 2 2 2 2 1 2 1 1
From the pH distribution I cant see any trends in the quality group.
##
## 0.22 0.23 0.25 0.26 0.27 0.28 0.29 0.3 0.31 0.32 0.33 0.34 0.35 0.36 0.37
## 1 1 4 4 13 13 16 31 35 54 59 84 85 120 129
## 0.38 0.39 0.4 0.41 0.42 0.43 0.44 0.45 0.46 0.47 0.48 0.49 0.5 0.51 0.52
## 214 151 168 139 181 161 216 178 225 172 179 166 249 140 156
## 0.53 0.54 0.55 0.56 0.57 0.58 0.59 0.6 0.61 0.62 0.63 0.64 0.65 0.66 0.67
## 135 167 102 108 83 99 97 88 45 68 48 67 28 36 35
## 0.68 0.69 0.7 0.71 0.72 0.73 0.74 0.75 0.76 0.77 0.78 0.79 0.8 0.81 0.82
## 44 30 27 18 33 12 19 22 19 16 19 16 5 5 13
## 0.83 0.84 0.85 0.86 0.87 0.88 0.89 0.9 0.92 0.94 0.95 0.96 0.97 0.98 0.99
## 2 4 3 2 2 7 1 5 2 2 5 3 1 6 1
## 1 1.01 1.06 1.08
## 1 1 1 1
For sulphates applies the same.
##
## 0.6 0.7 0.8 0.9 0.95 1
## 2 7 25 39 4 93
You can see that there are approximatley 50% of the values between 0.8 and 0.9, looks interesting.
Using log10 transformation we can see two classes of whine, one with less sugar and another group with more sugar.
You can find this density in all quality group, looks like the sweetness on its own is no quality mark.
##
## 0 0.01 0.02 0.03 0.04 0.05 0.06 0.07 0.08 0.09 0.1 0.11 0.12 0.13 0.14
## 19 7 6 2 12 5 6 12 4 12 14 1 19 17 27
## 0.15 0.16 0.17 0.18 0.19 0.2 0.21 0.22 0.23 0.24 0.25 0.26 0.27 0.28 0.29
## 23 33 27 49 48 70 66 104 83 181 136 219 216 282 223
## 0.3 0.31 0.32 0.33 0.34 0.35 0.36 0.37 0.38 0.39 0.4 0.41 0.42 0.43 0.44
## 307 200 257 183 225 137 177 134 122 101 117 82 95 37 63
## 0.45 0.46 0.47 0.48 0.49 0.5 0.51 0.52 0.53 0.54 0.55 0.56 0.57 0.58 0.59
## 46 51 38 39 215 35 25 23 16 19 11 22 13 21 6
## 0.6 0.61 0.62 0.63 0.64 0.65 0.66 0.67 0.68 0.69 0.7 0.71 0.72 0.73 0.74
## 6 9 14 4 6 8 7 7 7 5 3 9 5 5 41
## 0.78 0.79 0.8 0.81 0.82 0.86 0.88 0.91 0.99 1 1.23 1.66
## 2 2 2 2 2 1 1 2 1 5 1 1
Here it is very interesting that we have these spike at level 0.49, maybe we can find the reason for that.
Calculating the mean values for each quality group.
##
## Attaching package: 'dplyr'
##
## The following object is masked from 'package:stats':
##
## filter
##
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
## Source: local data frame [7 x 6]
##
## quality mean_alcohol mean_pH mean_density mean_chlorides mean_sulphates
## 1 3 10.34500 3.187500 0.9948840 0.05430000 0.4745000
## 2 4 10.15245 3.182883 0.9942767 0.05009816 0.4761350
## 3 5 9.80884 3.168833 0.9952626 0.05154633 0.4822032
## 4 6 10.57537 3.188599 0.9939613 0.04521747 0.4911056
## 5 7 11.36794 3.213898 0.9924524 0.03819091 0.5031023
## 6 8 11.63600 3.218686 0.9922359 0.03831429 0.4862286
## 7 9 12.18000 3.308000 0.9914600 0.02740000 0.4660000
## Source: local data frame [7 x 4]
##
## quality mean_t.sulfur.diox. mean_f.sulfur.diox. mean_r.sugar.
## 1 3 170.6000 53.32500 6.392500
## 2 4 125.2791 23.35890 4.628221
## 3 5 150.9046 36.43205 7.334969
## 4 6 137.0473 35.65059 6.441606
## 5 7 125.1148 34.12557 5.186477
## 6 8 126.1657 36.72000 5.671429
## 7 9 116.0000 33.40000 4.120000
## Source: local data frame [7 x 5]
##
## quality mean_citric.acid. mean_vol.acidity. mean_fix.acidity. n
## 1 3 0.3360000 0.3332500 7.600000 20
## 2 4 0.3042331 0.3812270 7.129448 163
## 3 5 0.3376527 0.3020110 6.933974 1457
## 4 6 0.3380255 0.2605641 6.837671 2198
## 5 7 0.3256250 0.2627670 6.734716 880
## 6 8 0.3265143 0.2774000 6.657143 175
## 7 9 0.3860000 0.2980000 7.420000 5
You can see the mean values for alcohol, very good wines have 10 %, very bad wines have 12 %
From the pH value you can say better wines are more sour.
Good wines tend to have density of 1, bad ones tend to 0.99
The chlorides group show that bad wines have higher value than better one.
Later we will look at more detail
The variable quality is ordered factor variables with the following levels.
(worst) … (best)The median of the quality is 6. In alcohol there is a spike at 9.5 in residual sugar there is also a spike at 2.
If you look back to the quality data we saw that 1457 white wines get quality 5, 2198 wines get a 6 and 880 wines get a 7. Now it is interesting to see that the distribution for quality is skewed.
The distribution for alcohol, sulphates, chlorides, residual sugar, volatile acidity are also skewed. That is my objective opinion by looking to the distribition charts.First I changed the type of the variable quality from int to factor. In the dataset quality is the only categorical factor.
The histogram for the variable critic.acid strainge because there is a spike at level 0.5
I used dplyr to group the values per quality and calulate the mean values for some choosen parameter.
I plot the data between the quantiles 1% and 99% to increase some huge spikes in the plot.
For the parameter residual.sugar I used log10 transformation, to show that there are two group of whines.
For a quick overview we created a correlation plot from all parameters:
The highest correlation is between residual.sugar and density
In the univariate capitel we saw that values of parameters changes in the different groups. Now we try to find some relations between the different parameters.
## fixed.acidity volatile.acidity citric.acid
## fixed.acidity 1.00000000 -0.02269729 0.289180698
## volatile.acidity -0.02269729 1.00000000 -0.149471811
## citric.acid 0.28918070 -0.14947181 1.000000000
## residual.sugar 0.08902070 0.06428606 0.094211624
## chlorides 0.02308564 0.07051157 0.114364448
## free.sulfur.dioxide -0.04939586 -0.09701194 0.094077221
## total.sulfur.dioxide 0.09106976 0.08926050 0.121130798
## density 0.26533101 0.02711385 0.149502571
## pH -0.42585829 -0.03191537 -0.163748211
## sulphates -0.01714299 -0.03572815 0.062330940
## alcohol -0.12088112 0.06771794 -0.075728730
## qual -0.11366283 -0.19472297 -0.009209091
## residual.sugar chlorides free.sulfur.dioxide
## fixed.acidity 0.08902070 0.02308564 -0.0493958591
## volatile.acidity 0.06428606 0.07051157 -0.0970119393
## citric.acid 0.09421162 0.11436445 0.0940772210
## residual.sugar 1.00000000 0.08868454 0.2990983537
## chlorides 0.08868454 1.00000000 0.1013923521
## free.sulfur.dioxide 0.29909835 0.10139235 1.0000000000
## total.sulfur.dioxide 0.40143931 0.19891030 0.6155009650
## density 0.83896645 0.25721132 0.2942104109
## pH -0.19413345 -0.09043946 -0.0006177961
## sulphates -0.02666437 0.01676288 0.0592172458
## alcohol -0.45063122 -0.36018871 -0.2501039415
## qual -0.09757683 -0.20993441 0.0081580671
## total.sulfur.dioxide density pH
## fixed.acidity 0.091069756 0.26533101 -0.4258582910
## volatile.acidity 0.089260504 0.02711385 -0.0319153683
## citric.acid 0.121130798 0.14950257 -0.1637482114
## residual.sugar 0.401439311 0.83896645 -0.1941334540
## chlorides 0.198910300 0.25721132 -0.0904394560
## free.sulfur.dioxide 0.615500965 0.29421041 -0.0006177961
## total.sulfur.dioxide 1.000000000 0.52988132 0.0023209718
## density 0.529881324 1.00000000 -0.0935914935
## pH 0.002320972 -0.09359149 1.0000000000
## sulphates 0.134562367 0.07449315 0.1559514973
## alcohol -0.448892102 -0.78013762 0.1214320987
## qual -0.174737218 -0.30712331 0.0994272457
## sulphates alcohol qual
## fixed.acidity -0.01714299 -0.12088112 -0.113662831
## volatile.acidity -0.03572815 0.06771794 -0.194722969
## citric.acid 0.06233094 -0.07572873 -0.009209091
## residual.sugar -0.02666437 -0.45063122 -0.097576829
## chlorides 0.01676288 -0.36018871 -0.209934411
## free.sulfur.dioxide 0.05921725 -0.25010394 0.008158067
## total.sulfur.dioxide 0.13456237 -0.44889210 -0.174737218
## density 0.07449315 -0.78013762 -0.307123313
## pH 0.15595150 0.12143210 0.099427246
## sulphates 1.00000000 -0.01743277 0.053677877
## alcohol -0.01743277 1.00000000 0.435574715
## qual 0.05367788 0.43557472 1.000000000
##
## Attaching package: 'GGally'
##
## The following object is masked from 'package:dplyr':
##
## nasa
## Warning in text.default(pos.xlabel[, 1], pos.xlabel[, 2], newcolnames, srt
## = tl.srt, : "cex.col" is not a graphical parameter
## Warning in text.default(pos.xlabel[, 1], pos.xlabel[, 2], newcolnames, srt
## = tl.srt, : "cex.var" is not a graphical parameter
## Warning in text.default(pos.ylabel[, 1], pos.ylabel[, 2], newrownames, col
## = tl.col, : "cex.col" is not a graphical parameter
## Warning in text.default(pos.ylabel[, 1], pos.ylabel[, 2], newrownames, col
## = tl.col, : "cex.var" is not a graphical parameter
## Warning in title(title, ...): "cex.col" is not a graphical parameter
## Warning in title(title, ...): "cex.var" is not a graphical parameter
The high positive correlation we can find between free.sulfur.dioxide, total.sulfur.dioxide, residual.sugar and density and on the other site negative correlations between alcohol and total.sulfur.dioxide, residual.sugar, density and chlorides.
In the next step we will check the scatterplot of the four highest correlations.
## Warning in loop_apply(n, do.ply): Removed 157 rows containing missing
## values (stat_smooth).
## Warning in loop_apply(n, do.ply): Removed 206 rows containing missing
## values (geom_point).
## Warning in loop_apply(n, do.ply): Removed 2 rows containing missing values
## (geom_path).
## [1] -0.7801376
## Warning in loop_apply(n, do.ply): Removed 163 rows containing missing
## values (stat_smooth).
## Warning in loop_apply(n, do.ply): Removed 183 rows containing missing
## values (geom_point).
## [1] 0.615501
## Warning in loop_apply(n, do.ply): Removed 160 rows containing missing
## values (stat_smooth).
## Warning in loop_apply(n, do.ply): Removed 185 rows containing missing
## values (geom_point).
## [1] 0.8389665
## Warning in loop_apply(n, do.ply): Removed 101 rows containing missing
## values (stat_smooth).
## Warning in loop_apply(n, do.ply): Removed 318 rows containing missing
## values (geom_point).
## [1] -0.3071233
The biggest impact for quality is density (negative correlated) and alcohol (positive correlated), for that we will make a Boxplot
As shown in the univariate section bad wines have a higher density than good wines.
Bad wines have less alcohol compared to good wines
residual.sugar and density has a high positive correlation, comparted to the other correlation factor, so I will reject one of this parameter for a linear model.
There is a very strong negative correlation between alcohol and density of -0.78
Quality has correlations to density, chlorides, volatile.acidity and alcohol.
The strongest relationship for building a model to predict the quality of red wine is alcohol with correlation 0.44
In the first graph we show the alcohol for each quality group in one chart
To show the interesting relationship between alcohol, density and quality on one hand and residual.sugar, density and quality on the other hand we make this plot.
## Loading required package: grid
## Warning in loop_apply(n, do.ply): Removed 157 rows containing missing
## values (stat_smooth).
## Warning in loop_apply(n, do.ply): Removed 157 rows containing missing
## values (geom_point).
## Warning in loop_apply(n, do.ply): Removed 2 rows containing missing values
## (geom_path).
## Warning in loop_apply(n, do.ply): Removed 160 rows containing missing
## values (stat_smooth).
## Warning in loop_apply(n, do.ply): Removed 160 rows containing missing
## values (geom_point).
To answer the question how sugar impact on alcohol and density, see next plot.
## Warning in loop_apply(n, do.ply): Removed 157 rows containing missing
## values (geom_point).
In the last picture I will show the density function for alcohol, pH, density and chlorides to show grafically what we did in the end of the Univariaty Analysis by using the group function
## Warning in loop_apply(n, do.ply): Removed 4 rows containing non-finite
## values (stat_density).
## Warning in loop_apply(n, do.ply): Removed 24 rows containing non-finite
## values (stat_density).
## Warning in loop_apply(n, do.ply): Removed 22 rows containing non-finite
## values (stat_density).
## Warning in loop_apply(n, do.ply): Removed 4 rows containing non-finite
## values (stat_density).
## Warning in loop_apply(n, do.ply): Removed 1 rows containing non-finite
## values (stat_density).
## Warning in loop_apply(n, do.ply): Removed 9 rows containing non-finite
## values (stat_density).
## Warning in loop_apply(n, do.ply): Removed 81 rows containing non-finite
## values (stat_density).
## Warning in loop_apply(n, do.ply): Removed 73 rows containing non-finite
## values (stat_density).
## Warning in loop_apply(n, do.ply): Removed 17 rows containing non-finite
## values (stat_density).
## Warning in loop_apply(n, do.ply): Removed 5 rows containing non-finite
## values (stat_density).
## Warning in loop_apply(n, do.ply): Removed 1 rows containing non-finite
## values (stat_density).
From my investigation I would choose the variables alcohol and volatile.acidity to build a linear modell to predict quality. If you compare that modell to the linear model that uses all parameter, we can descripte with our two parameter as much.
## Loading required package: MASS
##
## Attaching package: 'MASS'
##
## The following object is masked from 'package:dplyr':
##
## select
##
##
## Attaching package: 'memisc'
##
## The following objects are masked from 'package:dplyr':
##
## collect, query, rename
##
## The following objects are masked from 'package:stats':
##
## contr.sum, contr.treatment, contrasts
##
## The following object is masked from 'package:base':
##
## as.array
##
## Calls:
## lin: lm(formula = as.numeric(quality) ~ alcohol, data = wqw[, 2:13])
## lin2: lm(formula = as.numeric(quality) ~ alcohol + volatile.acidity,
## data = wqw[, 2:13])
##
## =====================================
## lin lin2
## -------------------------------------
## (Intercept) 0.582*** 1.017***
## (0.098) (0.098)
## alcohol 0.313*** 0.324***
## (0.009) (0.009)
## volatile.acidity -1.979***
## (0.110)
## -------------------------------------
## R-squared 0.190 0.240
## adj. R-squared 0.190 0.240
## sigma 0.797 0.772
## F 1146.395 773.875
## p 0.000 0.000
## Log-likelihood -5839.391 -5681.776
## Deviance 3112.257 2918.264
## AIC 11684.782 11371.552
## BIC 11704.272 11397.538
## N 4898 4898
## =====================================
##
## Call:
## lm(formula = as.numeric(quality) ~ ., data = wqw[, 2:13])
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.8348 -0.4934 -0.0379 0.4637 3.1143
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.482e+02 1.880e+01 7.881 3.98e-15 ***
## fixed.acidity 6.552e-02 2.087e-02 3.139 0.00171 **
## volatile.acidity -1.863e+00 1.138e-01 -16.373 < 2e-16 ***
## citric.acid 2.209e-02 9.577e-02 0.231 0.81759
## residual.sugar 8.148e-02 7.527e-03 10.825 < 2e-16 ***
## chlorides -2.473e-01 5.465e-01 -0.452 0.65097
## free.sulfur.dioxide 3.733e-03 8.441e-04 4.422 9.99e-06 ***
## total.sulfur.dioxide -2.857e-04 3.781e-04 -0.756 0.44979
## density -1.503e+02 1.907e+01 -7.879 4.04e-15 ***
## pH 6.863e-01 1.054e-01 6.513 8.10e-11 ***
## sulphates 6.315e-01 1.004e-01 6.291 3.44e-10 ***
## alcohol 1.935e-01 2.422e-02 7.988 1.70e-15 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.7514 on 4886 degrees of freedom
## Multiple R-squared: 0.2819, Adjusted R-squared: 0.2803
## F-statistic: 174.3 on 11 and 4886 DF, p-value: < 2.2e-16
The most important parameter for predicting quality is alcohol and volatile.acidity that will be shown in the linear model.
Yes it was very surprising that all high correlated parameter with quality has a high correlation with alcohol, for example
quality <-> total.sulfur.dioxide <-> alcohol
quality <-> density <-> alcohol
quality <-> chlorides <-> alcohol
Yes I build a linear model with the paramter alcohol and volatile.acidity.
The linear model has a R-squared of 0.24, that is very bad.
## Warning in text.default(pos.xlabel[, 1], pos.xlabel[, 2], newcolnames, srt
## = tl.srt, : "cex.col" is not a graphical parameter
## Warning in text.default(pos.xlabel[, 1], pos.xlabel[, 2], newcolnames, srt
## = tl.srt, : "cex.var" is not a graphical parameter
## Warning in text.default(pos.ylabel[, 1], pos.ylabel[, 2], newrownames, col
## = tl.col, : "cex.col" is not a graphical parameter
## Warning in text.default(pos.ylabel[, 1], pos.ylabel[, 2], newrownames, col
## = tl.col, : "cex.var" is not a graphical parameter
## Warning in title(title, ...): "cex.col" is not a graphical parameter
## Warning in title(title, ...): "cex.var" is not a graphical parameter
## Warning in loop_apply(n, do.ply): Removed 157 rows containing missing
## values (stat_smooth).
## Warning in loop_apply(n, do.ply): Removed 157 rows containing missing
## values (geom_point).
## Warning in loop_apply(n, do.ply): Removed 7 rows containing missing values
## (geom_path).
It was a nice experience to work with that dataset. At the beginning I was happy to deal with no factors, on a second look I realized that the variable quality is a factor but it is used as integer. First I plot all histograms to get an idea of the dataset. There are ten different quality factors; this dataset uses only three (meaning that for three different categories more than 800 data’s are available). The second part analyzes the correlations; I was very surprised that alcohol and quality have a high correlation to the same parameters. That makes it very hard to find the parameters for a linear model. First I thought pH, sugar and alcohol are the main parameters but the data tell a different story. By choosing the parameter alcohol and, volatile.acidity I created a linear model with R-squared of 0.24 - that’s a very bad result. Reasons for that could be that the dataset has too less values for the different quality categories to get a representative result or the objective parameter quality does not fit with the chemical parameters. It would be very interesting to get a dataset with more data per quality.